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Abstract 

Statistically resolving the underlying haplotype pair for a genotype measure- 
ment is an important intermediate step in gene mapping studies, and has 
received much attention recently. Consequently, a variety of methods for this 
problem have been developed. Different methods employ different statistical 
models, and thus implicitly encode different assumptions about the nature of 
the underlying haplotype structure. Depending on the population sample in 
question, their relative performance can vary greatly, and it is unclear which 
method to choose for a particular sample. Instead of choosing a single method, 
we explore combining predictions returned by different methods in a principled 
way, and thereby circumvent the problem of method selection. 

We propose several techniques for combining haplotype reconstructions and 
analyze their computational properties. In an experimental study on real- 
world haplotype data we show that such techniques can provide more accurate 
and robust reconstructions, and are useful for outlier detection. Typically, 
the combined prediction is at least as accurate as or even more accurate than 
the best individual method, effectively circumventing the method selection 
problem. 
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1 Introduction 



Complex diseases such as Diabetes or Alzheimer's disease are often linked to 
individual genetic variations. The analysis of genetic variation in human pop- 
ulations is therefore critical for understanding individual risk factors for such 
diseases. Most of the human genome is invariant among individuals, and it 
is sufficient to concentrate on small parts of the whole genome sequence to 
analyze genetic variation. Frequently studied differences are single nucleotide 
polymorphisms (SNPs), which are single-nucleotide variations at a particular 
location in the genome. The positions in the sequence are called markers 
and the different possible values alleles. A haplotype is a sequence of SNP 
alleles along a chromosome, and concisely represent the variable genetic in- 
formation in that region. In the search for DNA sequence variants which are 
related to common diseases, haplotype-based approaches have become a cen- 
tral theme |The05j . 

Diploid human cells have two homologous (i.e., almost identical) copies of each 
chromosome. Current practical laboratory measurement techniques produce 
a genotype — for m markers, a sequence of m unordered pairs of alleles. A 
genotype reveals the two alleles that are present at each marker, but not their 
respective chromosome origin. To obtain haplotypes from genotype data, this 
hidden phase information has to be reconstructed. Two alternative approaches 
exist: If family trios are available, most of the ambiguity in the haplotype pair 
can be resolved analytically. Otherwise, population-based statistical methods 
have to be used to estimate the haplotype pair. Because trios are more difficult 
to recruit and more expensive to genotype, the population-based haplotyping 
approach is often the only cost-effective method for large-scale studies. 

The haplotyping problem has received a lot of attention recently, and many dif- 
ferent haplotyping methods have been proposed |SS05[ ISS061 IKSOSal lEGTOGj 
IRKMU05] . All of these methods employ different statistical models, which re- 
flect different assumptions about the underlying distribution over haplotypes 
in a population sample. Furthermore, the methods offer different trade-offs 
in terms of reconstruction accuracy and scaling behavior in the number of 
markers and individuals in the sample. On the other hand, the statistical 
properties of haplotype "datasets" (a particular set of markers genotyped for 
a particular set of individuals) vary depending on marker spacing, sample size 
and population characteristics. In fact, some haplotyping methods have been 
specifically tailored to particular dataset characteristics. For example, the 
HIT system |RKMU05] is especially effective for population isolates, and the 
HaploRec system [EGTOGj for reconstruction of large, possibly genome-wide 
marker maps. 

It is therefore unlikely that there is one haplotyping method which is generally 
superior. Instead, the relative performance of different methods will vary de- 
pending on the characteristics of the dataset to be haplotyped. In contrast to 
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other statistical modeling tasks, in haplotyping there is typically no "training 
data" available for which the ground truth is known. This precludes the use of 
model selection techniques such as cross-validation (although it is possible to 
use cross-validation estimates of performance on related tasks such as missing 
genotype imputation for model selection, see e.g. |SS06j ). Nevertheless, one 
often has to commit to just one haplotype reconstruction in the end. Hence, 
it is natural to ask whether the predictions of the different methods could be 
combined in a simple way to give more accurate and robust haplotype recon- 
structions without having to know in advance which of the baseline methods 
performs well on the dataset at hand. 

In this paper we study how to combine haplotype reconstructions produced 
by various methods. We formulate several approaches for combining haplo- 
typers, study the algorithmics of the problem, and experimentally validate 
that combining haplotypers is beneficial. 

2 Population-based haplotyping 

A haplotype h can be represented as a sequence of alleles h[i] in markers 
i = 1, . . . ,m. For most SNP markers, only two alternative nucleotides (alleles) 
occur in a population, so we can assume h G {0, l}*". A genotype g for an indi- 
vidual can be represented as a sequence of unordered pairs g[i] = {h^[i], h'^[i]} 
of alleles in markers i = l,...,m. Hence, g G {{0, 0}, {1, 1}, {0, 1}}™. A 
marker with alleles {0,0} or {1,1} is homozygous whereas a marker with alle- 
les {0, 1} is heterozygous. We denote the number of heterozygous markers by 
m', and their positions in the haplotype sequence by ii, . . . ,im'- 

The haplotyping problem arises from the fact that while each haplotype pair 
corresponds to a unique genotype, a genotype may correspond to a large num- 
ber of different haplotype pairs. Population-based haplotyping is the task of 
statistically resolving this ambiguity: 

The haplotype reconstruction problem: Given a multiset Q of geno- 
types, find for each genotype g E Q the haplotypes /i^ and h"^ that have 
generated g. 

For the rest of the paper we will denote the two individual haplotypes in a 
haplotype pair as and h"^, and use h as a. shorthand to denote the pair 
{/i^,/i^} when there is no ambiguity. Furthermore, we denote a substring 
s[2]s[2 + 1] . . . s[i + fc] of a string s by s[i, k]. 

For each genotype g G {0, 1}™, there are 2"^'"^ different haplotype reconstruc- 
tions. Only one of these reconstructions is correct, so inferring the haplotypes 
is clearly impossible without additional information or assumptions. These 
assumptions are typically inspired by population genetics, and can take either 
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a combinatorial or a probabilistic form. The models borrowed from popula- 
tion genetics are often rather simplistic abstractions of the complicated reality. 
Furthermore, additional simplifications and heuristics may be needed to make 
haplotype inference computationally tractable. The number of ways to com- 
bine these choices — which of the imperfect population genetics models to build 
on and which computational strategies to use — has lead to the development of 
a large and diverse set of different haplotyping methods, each with their own 
advantages. The following lists just a few prominent examples. 

The currently most widely used method Phase |SSD01l ISD031 ISSOSj is based 
on quite sophisticated probabilistic models and is computationally expensive; 
fastPHASE |SS06] . a more efficient but still almost as accurate method has 
been published recently. Several other methods have recently been developed. 
Gerbil |KS05at IKSOSb] is based on reconstructing block partitioning and 
resolving the haplotypes simultaneously. HAP |HE04j implements a method 
based on imperfect phylogeny. HIT |RKMUn5j and HINT [KSOScj use HMM 
founder models for haplotyping. HaploRec |EGT04[ lEGTOGj is based on 
variable-length Markov chains. SpaMM |LME+06l lLME+07] is an approach 
based on levelwise construction of constrained Hidden Markov Models. 

3 Combining haplotypers 

In practice, genetics researchers often face the problem that different haplotype 
reconstruction methods give different results and there is no straightforward 
way to decide which method to choose. Due to the varying characteristics 
of haplotyping datasets, it is unlikely that one haplotyping method is gener- 
ally superior. Instead, different methods have different relative strengths and 
weaknesses, and will fail in different parts of the reconstruction. 

The promise of ensemble methods lies in "averaging out" those errors, as far as 
they are specific to a small subset of methods (rather than a systematic error 
affecting all methods). This intuition can be made precise by making proba- 
bilistic assumptions about how the reconstruction methods err: If the errors 
in the reconstructions were small random perturbations of the true haplotype 
pair, taking a majority vote (in an appropriate sense depending on the type of 
perturbations) of sufficiently many reconstructions would with high probabil- 
ity correct all the errors. While such probabilistic assumptions are not true in 
practice, they serve as a guideline and motivation for the combination methods 
we derive next. 

The idea of using ensemble methods in haplotyping is not entirely new. It is 
used in existing systems for combining results from several random restarts of a 
method |SS06l IGus02] , or to obtain a point estimate from an inferred posterior 
distribution on haplotypes |SSD01] . However, to the best of our knowledge, our 
approach of combining unrelated haplotypers — and thus gaining the benefits 
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of their potentially orthogonal strengths — has not been studied before. 

The haplotyper combination problem can be viewed as an instance of the 
general problem of finding a consensus object for a given collection of ob- 
jects. In the simplest case the objects are individual predictions as in en- 
semble methods in machine learning |Bre96l ICSS021 ISG02] . The objects can 
also be more complicated structures such as sequences [JXL04[ ILP02[ ISP03j , 
rankings |DKNS01l lFKM+04[ IFlSS03j . clusterings |ACN05l IGMT051 [LB05] . 



or segmentations [MTTOGj . Although sequential prediction has been studied a 
lot, there exists little work on ensemble methods for sequence prediction. Our 
approach to haplotyper combination resembles closely the work on combin- 
ing part-of-speech taggers |Sjo03| to improve tagging accuracy. However, due 
to the nature of haplotype data, we need more refined strategies than simple 
position-wise voting. 



3.1 Problem definitions 

To combine the haplotypings suggested by I given baseline haplotype recon- 
struction methods, we formulate two computational problems. We limit our- 
selves to combination methods that process each individual separately, thus 
enabling immediate parallelization of the combination strategies for large pop- 
ulations. 

Problem 1 (Haplotyper combination). Given the haplotype reconstructions 
{h\,hl},...,{h},hf} C {0,1}"^, and a distance function rf: {0, l}™x {0, 1}™ ^ 
M>o, find: 

• HVP: a reconstruction {/i^, /i^} C {0, 1}"^ minimizing the sum of dis- 
tances, i.e., find 

I 

{h\h^}= argmin Y^diihl h^i}, {h\ h^})- 
hlh^e{OA} i=i 

• HSP: a reconstruction {hj, hf}, i G {1, ■ ■ ■ ,1} minimizing the sum of 
distances, i.e., find 

I 

i = argmin\2di{hl,h'^i},{h],h'^A). 
In both cases, ties are broken arbitrarily. 

The difference between the Haplotyper Voting Problem (HVP) and the Hap- 
lotyper Selection Problem (HSP) is that in the latter, the solution is required 
to be one of the input haplotype reconstructions. Using clustering terminol- 
ogy, the Haplotyper Voting Problem (HVP) seeks for the average haplotype 
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reconstruction based on the input haplotypings, whereas the Haplotyper Selec- 
tion Problem (HSP) selects the median haplotype reconstruction as the most 
plausible haplotyping. The exact meaning of average and median depends, of 
course, on the properties of d. Ideally, d should be such that the solutions to 
HVP and HSP can be found efficiently and are close to the unknown true 
haplotypes. We will discuss viable candidates for such d later in Section I3.2[ 
While HSP can be solved efficiently by brute force provided that d can be 
computed efficiently, the computational aspects of HVP depend heavily on d. 
Thus, their discussion will be postponed to Section 13.31 

HVP and HSP are closely related for all distance functions d. A solution to 
HSP is a 2-approximation of a solution to HVP and the solution to HVP can 
be transformed into a 2-approximation of a solution to HSP. 

Proposition 1. Let d be a distance function between haplotype pairs satisfying 
the triangle inequality. Let hi = {h\, hf}, . . . ,hi = {hj, hf} be the haplotype 
pairs to combine and /i-HVP = {^hvp' ^hvp}' ^hsp = {^hsp' ^hsp) optimal 
HVP and HSP solutions. Then 

1. /iHVP is a feasible solution of HVP and 




2=1 



1=1 



1=1 



2. hj = argmin^^i i d{hi, /ihvp) is a feasible solution of HSP and 




Proof. We have J2i=i ^ihi, huyp) < X]2=i d{hi, /ihsp) because 






To see that Y.i=i d{hi, husp) < 2 Y^i- 
haplotype pair hj = {h^j,h'^},j G {1, 



^ d{hi, /iHVp); note that there must be a 
. . , /} such that 




2 = 1 
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Hence, 

I 



d{hi, hj) < ^ d{hi, /iHVp) + ld{hj, /ihvp) < 2 ^ /ihvp)- 

i=l i=l i=l 

Similarly ^'^-^ /ihsp) < ZlLi "^(^i' ^i) because 

I I I 

d{hi, /iHSp) = mill d{hi, hj) < hf) 

1=1 1=1 1=1 

for any j' G {1, . . . ,/}. 

To see that Yl\^i d{hi, hj) < 2 d{hi, /ihsp), note that 

1 ' 

min /i-nvp) ^ T X] ^(^^^ /iRVp)- 

7 = 



Hence, 



i / / 

^d{hi,hj) < '^d{hi,husp) + ^d{hj,huvp) 

i=l i=l i=l 

I 

= ^ husp) + ld{hj, /iHVp) 

i=l 

/ I 

< ^d{hi,hiisp) + ^d{hi,hjiyp) 

i=l i=l 

I 

< 2^d{hi,hnsp)- 



□ 



3.2 Distance functions 

In order to define average and median haplotypings, we need to choose a 
distance function d, for measuring the similarity between haplotype sequences. 
To satisfy the intuition that the solutions to HSP and HVP should be on 
average close to the baseline haplotype reconstructions, we will focus only on 
a small set of distance measures d that are reasonable candidates for measuring 
genetic distance between haplotype pairs. 

Hamming distance and other distances induced by distances on se- 
quences. The most common distance measure between sequences s,t E E"* 
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is the Hamming distance that counts the number of disagreements between s 
and t: 

dHis,t) = \{ie{l,...,m}:s\i]^t[i]}\. 

The Hamming distance is not directly apphcable as a measure of genetic dis- 
tance between individuals, because the haplotypes corresponding to an in- 
dividual's genotype form an unordered pair. To define a Hamming distance 
between unordered pairs of haplotypes, let us consider haplotype pairs {h\,hl} 
and {hi, h\}. The distance between the pairs should be zero if the sets {h\, h\} 
and {h\, h\\ arc the same. Hence, we should try both ways to pair the haplo- 
types and take the one with the smaller distance, i.e., 

dH{{hlhl},{hlhl}) = 

uAu{dH{h\, h\) + dnihl, hi), dnihl, hi) + dnih^ hi)} . 

Note that a similar construction can be used to map any distance function 
between haplotype sequences to a distance function between pairs of haplo- 
typings. Furthermore, the next proposition shows that if the distance function 
between the sequences satisfies the triangle inequality, so does the correspond- 
ing distance function for haplotype reconstructions. 

Proposition 2. Let d: x E"* — > R>o he a distance function between se- 
quences of length E"* and let 

d{{h\, h\}, {hi, hi}) = mm{d{h\, hi) + d{hl, hi), d{h\, hi) + d{hl, hi)} 

for all h\,h\,h\,hl G S™. If d satisfies the triangle inequality for comparing 
sequences, i.e., d{s,t) < d{s,u) + d(t,u) for all s,t,u G E"', then d satisfies 
the triangle inequality for comparing unordered pairs of sequences d{hi, /i2) < 
d{hi, hs) + d{h2, hz) for all h\, hi, h\, hi, h\, hi e TT. 

Proof. Choose arbitrary sequences h\,h\,h\,hl,h\,h\ G E™". We show that 
the claim holds for them and hence for all sequences of length m over the 
alphabet E. 

Assume, without loss of generality, that d{{h\,h\},{hl,hl}) = d{h\,hl) + 
d{hl hi) and d{{hl, hj}, {hi, hi}) = d{hl, hi) + d{hl hi). 

For d{{hl, hi}, {hi, hi}) there are two cases as it is the minimum of d{hl, hi) + 
d(/ii,/ii)and d{hl,hl) + d{hl,hl). 

If dahl, hi}, {hi, hi}) = dihl, hi) + d{hl, hi), then 

d{{h\, hi), {h\, hi}) + d{{hl, hi), {h\, hi}) 
= d{h\, hi) + dihl, hi) + dihl, hi) + dihl, hi) 
= [dihl, hi) + dihl, hi)] + [dihl, hi) + dihl, hi)] 
> dihl,hl) + dihl,hl). 
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If d{{hi hi), {hi hi}) = d{hl hi) + d{hi hi), then 

dahl, hi}, {hi, hi}) + d{{hl, hi}, {hi, hi}) 
= d{hl, hi) + d{h\, hi) + d{hl, hi) + d{h\, hi) 
= [d{h\, hi) + d{hl, hi)] + [d{hl, hi) + d{hl, hi)] 
> d{hl, hi) + d{h\, hi) > d{hl, hi) + d{h\, hi). 

Thus, the claim holds. □ 



Switch distance. The approach of defining distance functions between hap- 
lotype pairs based on distance functions between haplotypes has some limita- 
tions, independently of the distance function used. This is because much of the 
variance in haplotypes originates from chromosomal crossover during meiosis, 
which breaks up chromosomes and reconnects the resulting segments to form 
new chromosomes for the offspring. The chromosome pair resulting from a 
crossover could be seen as genetically close to the original pair even if the 
individual sequences do not match very well. Switch distance is a distance 
measure for haplotype pairs that takes such similarities into account. It is 
defined as the number of switches that are needed to transform a haplotype 
pair to another haplotype pair with the same homozygous and heterozygous 
markers. A switch between markers i and i + 1 for a haplotype pair {h^, h^} 
transforms the pair = {h^[l,i]h^[i + l,m], h'^[l,i]h'^[i + l,m]} into the 

pair {h^[l, i]h'^[i + 1, m], h'^[l, i]h^[i + 1, m]} . It is easy to see that for any pair 
of haplotype reconstructions corresponding to the same genotype, there is a 
sequence of switches transforming one into the other. Thus, switch distance is 
well defined for the cases wc arc interested in. 

The switch distance has the advantage over the Hamming distance that the 
order of the haplotypes in the haplotype pair does not matter in the distance 
computation: the haplotype pair can be encoded uniquely as a bit sequence 
consisting of just the switches between the consecutive heterozygous markers, 
i.e., as a switch sequence: 

Definition 1 (Switch sequence). Let h^,h'^ e {0, 1}"* and let ii < . . . < im' 

be the heterozygous markers in {/i^, /i^}. The switch sequence of a haplotype 
pair {h^, h'^} is a sequence s{h^, K^) = s{h'^, h^) = s e {0, l}"^'-^ such that 

,r,i _ / if h^ih] = and h''[ij] = h^[ij+i] 

"^^J \ 1 if h%] ^ h%^^] and h%] ^ h\i^^,\ 

The switch distance between haplotype reconstructions can be defined in terms 
of the Hamming distance between switch sequences as follows. 

Definition 2 (Switch distance). Let h\ = {h\,hl} and /12 = {hi, hi} be 
haplotype pairs corresponding to the same genotype. The switch distance 
between the pairs is ds{hi, /i2) = dH{s{h\, hf), s{hl, h^)). 



3 COMBINING HAPLOTYPERS 



9 



As switch distance is the Hamming distance between the switch sequences, the 
following proposition is immediate: 

Proposition 3. The switch distance satisfies the triangle inequality. 

fc-Hamming distance. Switch distance considers only a very small neigh- 
borhood of each marker, namely only the previous and the next heterozygous 
marker in the haplotype. On the other extreme, the Hamming distance uses the 
complete neighborhood (via the min operation), i.e., the whole haplotypes for 
each marker. The intermediate cases are covered by the following fc-Hamming 
distance in which all windows of a chosen length k G {2, . . . , m} are consid- 
ered. The intuition behind the definition is that each window of length A; is a 
potential location for a gene, and we want to measure how close the haplotype 
reconstruction gets to the true haplotype {hl,hl} in predicting each 

of these potential genes. 

Definition 3 (/c-Hamming distance). Let {h\,hi} and {/ig, /i^} t>e pairs of 
haplotype sequences corresponding to the same genotype with m' heterozy- 
gous markers in positions ii, . . . ,im- The fc-Hamming distance d^-H between 
{h\,hl} and {hl,h2} is defined by 



unless m' < k, in which case dk-uihi, = duihi, h2). 

It is easy to see that d2-H = '2ds, and that for haplotyping pairs with m' het- 
erozygous markers, we have d^'-H = dm-H = dn- Thus, the switch distance 
and the Hamming distance are the two extreme cases between which dk-H 
interpolates for k = 2, . . . ,m. 

3.3 Algorithms and complexity 

The HSP problem is easily solved by trying out each of the / reconstructions 
as a candidate solution, and choosing the best one. The complexity of this 
straightforward strategy is roughly times the time needed to evaluate the 
distance function d. Thus, there seems to be no need for more efficient algo- 
rithms for HSP in practice. 

The complexity of HVP depends on ci in a more involved way. As we will show 
next, for d = ds a simple voting scheme gives the solution. The rest of the 
distances considered in Section 13.21 are more challenging. If ci = dk_H and k is 
small, the solution can be found by dynamic programming. For d = dk-n with 
large k and ci = c?//, we are aware of no efficient general solutions. However, 
we will outline methods that can solve most of the problem instances that one 
may encounter in practice. 
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Switch distance: d = ds- For the switch distance, the solution to HVP 
can be found by the following voting scheme: 

• Transform the haplotype reconstructions {h}, hf} C {Q, 1}™, i = 1, . . . , / 
into switch sequences si, . . . , e {0, 1}"*'"^. 

• Return the pair that shares the homozygous markers with the 

reconstructions {h},h'^} and whose switch sequence s e {0,1}"*'"^ is 

defined by = argmax |{j e {1, . . . , m' — 1} : Si\j] — b}\ . 
be{o,i} 

The time complexity of this method is 0{lm). 

/c-Hamming distance: d — d^-H ■ The optimal solution /ihvp — {^hvp' ^hvp} 
of HVP is given by 

I 

/iHVP = argmin y^dk-H{hi,h). 

{h\h^}<Z{0,l}-^ 

The number of potentially optimal solutions is 2™', but the solution can be 
constructed incrementally based on the following observation: 

I 

hnvp = argmin^ dk-H {hi, h) 
{h\h^} 

I m'-k+l 

= argminy] dnihilij, ...,ij+k-i], h[ij, ...,ij+k-i]) 

{h^,h^} 7^1 

Hence, the cost of any solution is a sum of terms 

I 

Dj{{x,x}) ^ dnihilij, ... ,ij+k-i\,{x,x}), 

i=l 

j = 1, . . . ,m' — k + 1, X & {0, l}'^ and x denotes the complement of x. There 
are (m' — k + l)2''-^ such t erms. Furthermore, the cost of the optimal solution 
can be computed by dynamic programming using the recurrence relation 

r if J = 

Tj{{x,x}) = < DJ{x,x}) +minTi_i({bx,bx}) if ? >0 
l_ 6g{o,i} 

Namely, the cost of the optimal solution is mm^^^Q iykTjn'{{x,x}) and the 
optimal solution itself can be reconstructed by backtracking the path that leads 
to this position. The total time complexity for finding the optimal solution 
using dynamic programming is 0{lm-\-2^kl{m' — k)): the heterozygous markers 
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can be detected and the data can be projected onto them in time 0(lm), and 
the optimal haplotype reconstruction for the projected data can be computed 
in time 0{2'^kl{m' — k)). So the problem is fixed-parameter tractabl^ in k. 



Hamming distance: d = dn- An ordering {h^^K^) of an optimal solu- 
tion {h^,h?} to HVP with Hamming distance determines an ordering of the 
unordered input haplotype pairs {h\,h\}, . . . ,{h\,h^}. This ordering can 
be represented by a binary vector o = {oi,...,oi) G {0,1}^ that states for 
each i = that the ordering of {h},hf} is {h\'^°\h1~°'). Thus, Oj = 

argminf^^_^Q^^'^ dnih^, h}^''), where ties are broken arbitrarily. 

If the ordering o is known and / is odd, the optimal haplotype reconstruction 
can be determined in time 0{lm) using the formulae 



h^[i] = argmax 
be{o,i} 



{je{l,...,l}:h]^^^[^=b]\ (1) 



and 

h'^\i] = argmax = |j G {1, ...,/} : h^'"' [i] = b] . (2) 
f)e{o,i} J 

Hence, solving HVP is polynomial-time equivalent to the task of determin- 
ing the ordering vector o corresponding to the best haplotype reconstruction 
{h\h^}. 

The straightforward way to find the optimal ordering is to evaluate the quality 
of each of the 2'^^ non-equivalent orderings. The quality of a single ordering 
can be evaluated in time 0{lm). Hence, the HVP problem can be solved in 
total time 0{lm + 2Hm'). The runtime can be reduced to 0{lm + 2'm') by 
using Gray codes |Sav97j to enumerate all bit vectors o in such order that 
consecutive bit vectors differ only by one bit. Hence, the problem is fixed- 
parameter tractable in I (i.e., in the number of methods). If / is large, however, 
a more clever strategy is needed. We are unaware of a tractable efficient 
general solution and suspect that HVP for d = d^ is NP-complete in general. 
However, we have efficient solutions to two special cases of practical relevance: 

Small number of heterozygous markers. If the number of heterozygous posi- 
tions m' is small, we can simply enumerate all the 2™ ~^ non-equivalent possible 
solutions to the problem, and pick the optimal one from among them. The 
time complexity of this approach is 0{2"^ Im'). Thus, the problem is fixed- 
parameter tractable also in m' (the number of heterozygous markers). 

All reconstructions close to the optimal solution for HVP. Fixing an ordering 
to any one of the input haplotype reconstructions {h},hf} induces an ordering 



problem is called fixed-parameter tractable in a parameter k, if the running time of 
the algorithm is f{k)0{n'^) where k is some parameter of the input and c is a constant (and 
hence not depending on k.) For a good introduction to fixed-parameter tractability and 
parameterized complexity, see |FG06) . 
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to the remaining input haplotypes. This ordering can be used to compute 
a solution to HVP through equations [T] and O The next proposition shows 
that the solution obtained in this way is provably optimal if all input haplotype 
reconstructions are within m'/2 of the optimal solution {/invP' ^hvp) HVP. 

Proposition 4. If dniih}, hf}, {/^hvp' ^hvpI) < m' /2 for each z G {1, . . . , /}, 

then the ordering induced by any of the input haplotype pairs is equivalent to 
the ordering corresponding to the optimal solution to HVP. 

Proof. By assumption, dnih]^"', ^hvp) < ""^'/^ for each i = 1, . . . , / for one of 
the choices Oj G {0, 1}. Then 

dH{hl^"\ h]"-"^) < dHihl+"\ /i^vp) + dH{h]^°\hl,yp) < m'/A + m'/A = m'/2. 

Thus, if we use {h\'^°\ h?~°^) as a reference point, the induced ordering for 
the haplotypes will be the same o that is induced by using (^HVP' ^Hvp) ^ 
reference point. Switching the ordering in the reference point to (hj~°% hj^"^) 
will induce the equivalent ordering 1 — o. □ 

4 Experiments 

To investigate the haplotype reconstruction combination problem empirically, 
real-world genotype data was phased with different haplotyping systems and 
their reconstructions evaluated. The data was obtained from three sources: a 
collection of datasets from the Yoruba population in Ibadan, Nigeria |The05] . 
the well-known dataset derived from a European population of Daly et al. [DRS"'"Ol"] . 
and samples from the recently published D-HaploDB haplotype database |HMK"'"07] 
derived from a Japanese population. For the Yoruba and Daly data, true hap- 
lotype pairs were inferred from family trios. Furthermore, the nontransmitted 
parental chromosomes of each trio were combined to form additional artificial 
haplotype pairs. For the HaploDB dataset, definite haplotypes were deter- 
mined from complete hydatidiform moles (CHMs). The 74 available CHMs 
haplotypes were paired to form 37 diploid individuals. 

For all datasets markers with minor allele frequency of less than 5% and geno- 
types with more than 15% missing values were removed. For the Yoruba 
population, information on 3.8 million SNPs spread over the whole genome is 
available. We sampled 100 sets of 100 markers each from distinct regions on 
chromosome 1. There are 60 individuals in these datasets after preprocessing 
as described above, with an average fraction of missing values of 3.6% and 
32.2% heterozygous markers. For the Daly dataset, there is information on 
103 markers and 174 individuals available after data preprocessing, the aver- 
age fraction of missing values is 7.9% and the average fraction of heterozygous 
markers is 30.6%. In HaploDB, a genome-wide set of 281 439 SNP markers 
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Table 1: Switch (top-right triangle) and Hamming (bottom-left triangle) dis- 
tances between the truth and the baseline methods for the Daly dataset. 
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HR 


736 


676 


784 


716 




119 


150 


G 
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546 
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654 


590 


728 
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850 
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is available, from which we sampled 100 sets of 100 markers each from dis- 
tinct regions on chromosome 1. The average fraction of missing values is 3.1% 
and the fraction of heterozygous markers is 39.9%. All datasets were phased 
with each of the following 6 publicly available haplotyping systems, yielding 
6 different reconstructed haplotype pairs for every genotype: Phase version 
2.1.1. }SS05] . fastPHASE version 1.1. |SS06] . Gerbil as included in Gevalt 
version 1.0. |KS05a] . HIT |RKMIJ05] . HaploRec version 2.0. |EGT06j and 
SpaMM version 1.0. |LME+06j . All methods were run using their default 
parameters. 

Let us first consider how the reconstructions produced by the baseline methods 
differ on the Daly dataset. Table [T] shows the switch and Hamming distances 
between the different haplotype reconstructions, including the reconstructions 
inferred from the family trios (trio) as the ground truth, and the methods 
fastPHASE (fP), HIT (HIT), SpaMM (S), HaploRec (HR), Gerbil (G), 
and Phase (P). The fastPHASE system clearly has the smallest reconstruction 
error with respect to switch and Hamming distances on the Daly dataset. 
While the accuracy performance of the other methods is worse, the distances 
between all the methods are of the same order of magnitude. This indicates 
that it makes sense to try to combine the haplotypers. 

We tested the haplotyper selection and voting techniques using the Daly, 
Yoruba and HaploDB datasets. As it is not clear which combination of se- 
lection/voting and internal distance measure (switch distance, Hamming dis- 
tance, fc-Hamming distance) yields best results, systematic experiments using 
all different combinations were performed. The quality of the resulting recon- 
structions is measured by switch distance only, as this is the standard way of 
measuring the quality of reconstructions in haplotyping experiments. 

The main goals of the experimental study are as follows. First, the goal is to 
evaluate whether the simple combination approaches like selection and voting 
can be used to find a more robust solution when the best-performing method 
is not known. Second, the goal is to see whether the combination methods 
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Table 2: The total switch error between true haplotypes and the haplotype 
reconstructions over all individuals for the baseline methods. For Yoruba and 
HaploDB, the reported numbers are averages over the 100 datasets. 



Method 


Daly 


Yoruba 


HaploDB 


Phase 


145 


37.61 


108.36 


fastPHASE 


105 


45.87 


110.45 


SpaMM 


127 


54.69 


120.29 


HaploRec 


131 


56.62 


130.28 


HIT 


121 


73.23 


123.95 


Gerbil 


132 


75.05 


134.22 



improve over the baseline methods when using different subsets of the base- 
line methods. For this purpose, we consider leaving out one of the baseline 
methods Phase (the most accurate on the Yoruba and HaploDB datasets on 
average and on the HaploDB dataset), fastPHASE (most accurate on the Daly 
dataset), and Gerbil (slow and least accurate on all datasets), and also leav- 
ing out all three of them simultaneously. The results using these subsets are 
representative of results for other subsets we experimented with but do not 
report on here. Third, the goal is to find out how the haplotyper selection 
results compare to haplotyper voting results and how the different distance 
functions affect the quality of the solutions. 

The results for the baseline methods are summarized in Table [2] and results for 
the combination methods in Tableland Table HI Let us first consider the Daly 
dataset. The best baseline method is fastPHASE, resulting in 105 switch errors. 
The selection and voting methods applied to the set of all baseline methods 
produce results comparable to fastPHASE, and are consistently better than the 
haplotype reconstructions produced by any other baseline method. Thus, by 
employing the haplotyper combination approach, we can achieve performance 
comparable to the best baseline method without having to know which of 
the baseline methods is best in advance. Leaving out one of the methods 
Phase, fastPHASE, or Gerbil has no significant effects on the results of the 
combination methods. Thus, the combination methods seem to be quite robust 
against small perturbations of the set of baseline methods, even if they lead 
to the exclusion of the best performing method. If Phase, fastPHASE, and 
Gerbil are left out simultaneously, the results degrade below the level of 
fastPHASE, but are still better than those of any other method. 

The results on the Yoruba datasets follow a similar pattern, except that now 
PHASE-the baseline method with worst performance on the Daly dataset — is 
the best on average. The combination methods provide solutions comparable 
to those of the best method (Phase) and better than those of any other 
baseline method. When only subsets of the baseline methods are used, the 
performance of the combination methods drops, but not significantly unless 
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Table 3: The total switch error between true haplotypes and the haplotype 
reconstructions over all individuals for the haplotyper selection methods for 
different combinations of baseline haplotypers. For Yoruba and HaploDB, the 
reported numbers are averages over the 100 datasets. 



Haplotyper Selection 



Methods 


Distance 


Daly 


Yoruba 


HaploDB 


all methods 


ds 


1 no 

103 


37.67 


1 no AO 

103.43 
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114 


41.59 


107.32 




dn 


115 


45.30 


116.27 
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38.47 
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105 


38.53 


104.69 
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105.52 
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118 


39.07 


106.05 




dn 


113 
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111.57 


w/o P, fP, G 


ds 


116 


47.94 


113.95 
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114.39 
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48.48 


115.57 
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116.57 
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53.47 
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Table 4: The total switch error between true haplotypes and the haplotype 
reconstructions over all individuals for the haplotyper voting methods for dif- 
ferent combinations of baseline haplotypers. For Yoruba and HaploDB, the 
reported numbers are averages over the 100 datasets. 



Haplotyper Voting 



Methods 


Distance 
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all methods 


ds 
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Phase, fastPHASE, and Gerbil are left out simultaneously. 

On the HaploDB datasct the advantage of using combination methods is even 
more evident. The best baseline method PHASE is clearly outperformed by 
all combination methods except voting and selection with Hamming distance. 
The results are only slightly degraded when Phase, fastPHASE, or Gerbil 
is left out of the ensemble, and when they are all left out simultaneously, the 
performance of the combination strategies is still significantly better than that 
of the best remaining baseline method SpaMM. 

In summary, our results indicate that using the haplotyper combination ap- 
proach sometimes significantly increases the haplotyping accuracy, and never 
significantly decreases the accuracy in comparison to the best baseline method. 
Of course, in practice the identity of the best baseline method is not known 
and changes from dataset to dataset. In a more realistic comparison to the 
baseline method that does best on average on all the datasets (fastPHASE), all 
the proposed haplotyper combination methods are clearly more accurate on 
average. Hence, our experiments suggest that it is indeed better to combine 
the predictions of all the basehne methods than to (blindly) choose and use 
any one of them. 

In general, combination methods using switch distance as the distance func- 
tion tend to produce most accurate results. This suggests that the errors of 
the baseline methods resemble random switches rather than random single 
nucleotide mutations. The performance of different distance functions also 
depends on the density of the used marker map. For dense marker maps, 
larger windows are beneficial, whereas in sparse maps considering dependen- 
cies between consecutive markers probably suffices. Furthermore, the selection 
methods seem to perform slightly better than voting methods. A potential 
explanation is that the median haplotype reconstruction is more tolerant to 
random errors in the baseline methods than the mean haplotype reconstruc- 
tion. Further analysis is needed in order to fully understand the differences 
between the combination methods, but it seems safe to conclude that hap- 
lotyper selection with switch distance is the best choice (among combination 
methods and baseline methods) at least when no additional information abut 
the problem at hand is available. 

The computational price for the potential improvements in accuracy is the 
added effort of first running all the baseline methods and then solving the HVP 
or HSP problem. This may be a problem if some of the basehne methods are 
very slow. In such cases, we suggest the strategy of computing the predictions 
of as many baseline methods as time constraints permit, and combining the 
resulting reconstructions using one of the combination methods. The running 
times of the basehne methods vary greatly, so running, e.g., all but the slowest 
baseline method may well be much more efficient than running the slowest 
method alone. 
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Figure 1: The relative Hamming distance vs the number of heterozygous mark- 
ers in the Daly dataset. The plot shows that the relative error being higher 
than 1/2 is mainly the problem of individuals with small number of heterozy- 
gous markers. The fraction of individuals with the relative hamming error at 
least 1/2 is 20/174 ^ 11.5%. 

A major computational difficulty with Hamming voting is that the basic method 
for computing it scales exponentially in the number of haplotype reconstruc- 
tions per individual. In Section [3^ we showed that if the number of heterozy- 
gous markers is small or the relative Hamming score of the solution is at most 
1/2 then the ordering of the haplotypes in the pairs can be determined effi- 
ciently. Figure [1] illustrates that in the Daly dataset most of the individuals 
have either very small number of heterozygous markers or small relative Ham- 
ming error. This supports the hypothesis that the Hamming voting problem 
can be solved sufficiently efficiently in practice even with a larger number of 
baseline methods. 

Depending on the combination method there can be multiple solutions that 
have the same score but different distance to the ground truth. In Table O 
and Table H] ties are broken by selecting the solution with optimal score that 
was found first. The rule for breaking ties may have a significant effect on the 
final accuracy: For example, when applying switch voting to all the 6 methods 
on the Daly dataset, there are a total of 35 ties. Thus, depending on how 
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Figure 2: The switch error of PHASE vs the sum of the switch distances be- 
tween the basehne methods for Yoruba datasets. Each point corresponds to 
one of the Yoruba datasets, x-coordinate being the sum of distances between 
the reconstructions obtained by the basehne methods, and y-coordinate corre- 
sponding to the switch errors of the reconstructions by Phase. 

ties are broken, the error may be anywhere between 89 and 124. Breaking 
the ties randomly results on average in 106.5 errors, which is quite close to 
the result 105 obtained by our arbitrary tie breaking. The situation with 
other combination methods and datasets is similar. Better results might be 
obtained by more sophisticated tie breaking rules, e.g., by following the overall 
leader fastPHASE in case of ties. We leave the exploration of such advanced 
tie breaking rules for future work. 

When the experimental results are analyzed in detail at the level of individu- 
als, it turns out that the combination methods tend to fail only when many of 
the baseline methods perform rather badly. Even though it is hard to recover 
the true haplotypes in such cases, the fact that the baseline methods are in 
wide disagreement can be used to identify such problematic individuals. We 
have observed that the sum of distances between the baseline haplotype recon- 
structions has very high correlation (between 0.95 and 0.99) with the error in 
the final reconstructions for the individual. Figure [2] illustrates this for Phase 
using the Yoruba datasets. This indicates that combining haplotypers can 
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also be a strong method for outlier detection, which is helpful for removing 
probably incorrectly haplotyped individuals from further consideration. 

5 Conclusions 

Haplotype reconstruction is an important intermediate task in the study of ge- 
netic variations in human populations. Various techniques to reconstruct hap- 
lotypes from measurable genotype data have been proposed. Different methods 
typically return substantially different reconstructions, and there seems to be 
no method that is generally superior on all datasets. 

To overcome these difficulties, we have studied the problem of combining hap- 
lotypers in order to improve the haplotype reconstructions. More specifically, 
we have considered two variants of the problem: haplotyper voting, where the 
goal is to find a consensus haplotype reconstruction given multiple haplotype 
reconstructions, and haplotyper selection, where the goal is to find the best 
haplotype reconstruction for each individual. We have developed algorithms 
for using various internal distance functions. The experiments show that com- 
bining haplotypers provides improvements over the average performance of 
the haplotype reconstruction methods, and the reconstruction quality is even 
comparable to or better than the best method for each dataset. Hence, using 
the combination methods virtually never degrades the performance, and some- 
times gives clear advantages on accuracy in comparison to the best baseline 
method. 

According to the experiments, haplotyper selection with switch distance is 
consistently close to the best combination method. Thus, the original problem 
of having to choose a baseline methods has not been hfted to an analogous 
problem of choosing a combination method as haplotyper selection with switch 
distance seems to be a good choice always. 

Combining haplotypers opens many avenues to improved techniques for hap- 
lotype reconstruction. First, an obvious direction of refinements would be to 
use more complex combinator functions. For example, there could be a-priori 
knowledge about the performance of the methods on some part of the data or 
on other, similar datasets. This knowledge could be used to reweight methods 
in the combination algorithms, or pursue more complex prediction approaches 
such as decision trees or support vector machines. Such approaches would be 
especially useful for combining a large number of reconstructions of varying 
quality. Second, the different methods could be used to guide haplotype re- 
construction techniques, e.g., to detect potentially problematic regions of the 
data where the reconstruction model should be refined. Third, assessing the 
quality of haplotype reconstructions by combining haplotypers is a promising 
direction. Often it will be acceptable to discard part of the reconstructed 
haplotypes or markers to avoid errors. Our preliminary results suggest that 
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multiple haplotypc reconstructions could be used to detect individuals which 
are likely to be haplotyped erroneously and even problematic regions of the 
marker map. Fourth, more refined measures of the reconstruction quality are 
also of interest, for example modeling the dependence structure of the markers 
in more detail or taking genetic background information into account. Finally, 
we intend to evaluate the approach with further genotype datasets with known 
haplotypes as they become available. 
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